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Abstract 

The use of quadratic forms of the empirical process for the two-sample problem in the context of functional data is 
considered. The convergence of the family of statistics proposed to a Gaussian limit is established under metric entropy 
conditions for smooth functional data. The applicability of the proposed methodology is evaluated in examples. 
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1 Introduction 


Functional data analysis has had a very important growth in the last 20 years, and has found applications in many different 
areas, especially since the first edition of the book by Ramsay and Silverman in 1997. More recent contributions to the 
field can be found in Bosq j2000| l, [Ramsay and Silverman] ( 2002| 2005) 1, [Ferraty and Vieu ( 2006[ >, Ferraty| (2011 > and 
Horvath and Kokoszka| ( |2012| , where examples of diverse applications can also be found. 

In the analysis of functional data a frequent problem is that of deciding if two samples of functions come from the same 
population. Let Xi(t), • ■ ■ ,X m (t) be an i.i.d. sample of real valued curves defined on some interval J. Denote by Jf(X) 
the probability law producing these curves. Likewise, let Y\ (f), • ■ ■ . Y„(t ), be another i.i.d. sample of curves, independent 
of the X sample and also defined on J, with probability law ,'/ J (Y). In the two-sample problem, we wish to test the null 
hypothesis, Hq: ££ (X) = 2z?(T) against the general alternative 2?f(X) ^ ££ (7). 

This problem has been considered from several viewpoints. |Munoz Maldonado et al.[ ( |2002j l define a similarity index 
for curves based on the sample correlation coefficient of vectors obtained from evaluating the registered curves on a 
common grid and use permutation tests. Hall and Van Keilegom ( |2007) > study the effect of smoothing the functional data 
on the power of tests for the two-sample problem and propose bootstrap statistics that generalize the two-sample Cramer- 
von Mises methodology to the functional data setting. Benko et al. ( 2009[ ) consider the problem from the point of view 
of functional principal components. To test the differences between two samples of functions their respective Karhunen- 
Loeve expansions are considered. In particular, they develop a bootstrap test for testing common principal components. 
Horvath and Kokoszka (2009) consider the two sample problem for regressions of the form Y J = ij/'X' + e J .j = 1,2, 
where the X 1 are function over a compact subset of a Euclidean space, the responses Y J can either be functions or 
scalars and the i// 7 are linear operators over a function space which take either values in the same function space or 
scalar values. Using expansions with respect to the functional principal components, they develop a test for the equality 
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of the operators y/ J 


Pena 


(2012) presents several proposals for the functional two-sample problem, based mainly on 


permutation tests. Paparoditis and Sapatinas ( 2014j ) develop a general testing methodology for functional data based on 
bootstrap techniques, which is applicable to different testing problems and test statistics, including the comparison of 
mean or covariance functions. __ 

Ch. 5) consider samples Xf , i = 1,2,... 


In their book, Horvath and Kokoszka 


that they satisfy the models 


(2012 




7 = 1,2 


Hj,j =1,2. Under the assumption 

( 1 ) 


they propose two tests for the hypothesis Hq : /I 1 = /J, 2 in L 2 against the alternative that Hq is false. The first method is 
based on the sample estimators for the mean functions, while the second is based on the functional principal component 
expansions. We describe the latter in detail, since it is related to the method proposed in the present work. 

Assume that the two samples are independent, the noises are centered, ej ,i = 1, ...,tij are i.i.d. for fixed j and are 
independent for different j, although they are not assumed to have the same distribution in this case. Also £11 e (|| 4 < °° 
for j = 1,2. Consider the operator Z = (1 — 0)C 1 + 0C 2 ,0 < 0,< 1, where C J is the covariance operator corresponding 
to A- 7 , j =1,2. Assume the eigenvalues of Z satisfy 


Tt > t 2 > • • ■ > T d > x d +\ 


( 2 ) 


for some large d and let @i.... ,(p r i be the corresponding eigenfunctions. Let Z nii „ 2 be the sample version of Z and 
let 0i,...,be the eigenfunctions for this operator. Let A- 7 = (1 /w/)E"=i-X/(f), a ,■ = (A 1 — A 2 ,0,-),1 < i < d, and 
a = . ,dd) T ■ Assume that 

--- >6, for some 0 < 0 < 1. (3) 

m +H2 

Then, under all these conditions Horvath and Kokoszka prove that 


n\ri2 
n i +n 2 


i/ 2 „ 

a 


N d (0,Q) 


(4) 


where the limit is a (7-dimensional centered normal distribution with diagonal covariance matrix satisfying Q(i,i) = T,. In 
consequence they propose the statistics 
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( 6 ) 


where N\,...,Nd are independent Gaussian standard random variables. 

In this work a family of statistics for the two-sample problem on functional data is studied. It is a family of quadratic 
forms associated to dot products of functions of the samples with a finite number of adequately chosen functions. Details 
will be given in the next section. This family includes T 1 as a special case. 

As examples of applications of the family of statistics proposed here to real data, some problems in Oceanography 
are considered. The stochastic approach to the analysis of ocean waves originated in the 1950’s with the work of Pierson 
( 1955} ) and Longuet-Higgins ( |1956 1957[ ). This approach considers ocean waves as a realization of a random process, 
frequently a centered stationary Gaussian processes, and this point of view has permitted the analysis of many important 
features of waves. An account of this theory can be found in Ochi ( 1998) ). 

The assumption of stationarity permits the use of spectral analysis techniques to study the wave energy distribution in 
the frequency domain. This analysis is related to several important characteristics in Oceanography, such as the significant 
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Figure 1: Wave characteristics 


wave height H s , (see section[4]for a definition), a standard measure of sea severity which can be obtained from the spectral 
distribution of the process. On the other hand, Gaussian processes provide tractable models for which it is possible to 
obtain explicit distributions of many parameters of interest, and are suitable models in many circumstances. They also 
provide a good first order approximation when nonlinearities are present. 

However, both hypotheses have limitations. Stationarity is only a valid hypothesis for short periods of time, while 
normality fails for shallow water waves or when nonlinearities are present. As a first application of the class of statistics 
proposed, we consider the problem of testing whether two samples of estimated spectral densities coming from (simulated) 
random wave processes have the same distribution. This is related to the problem of determining stationary intervals in 
the sea surface behavior. 

As regards the assumption of normality, the Gaussian model cannot account for observed asymmetries in real waves, 
a fact that has been known for a long time. According to |Borgmanj ( j 1972] ), ‘Gaussian models involving superposition of 
linear waves predict all the probability properties of the sea surface. Yet the commonly observed property that wave crests 
reach higher above mean water level than the troughs fall below cannot be encompassed within the model.’ 

In Gorrostieta et ak]( |20141 >, one-dimensional random waves from a North-Sea storm were considered from a functional 
point of view. A wave is defined as the trajectory of the sea-surface elevation between two consecutive downcrossings 
of the mean sea level (see Figure [TJ. The mean waves obtained after registration for a series of 20-minute intervals were 
considered and several features such as first and second derivatives, phase diagrams and their relation with the significant 
wave height of the corresponding 20-minute period were analyzed. Also, a comparison between real and simulated 
Gaussian waves was made. For this comparison, the spectral density for each 20-minute period was estimated, and a 
Gaussian process with the same sampling frequency as the original data was simulated from the spectral density. Mean 
waves for both cases were compared using a randomization conditional test and the results gave strong evidence that 
real and simulated waves follow different distributions. That study also gave evidence of the asymmetry of real waves, 
compared to simulated Gaussian waves. 

As further applications of the family of statistics developed in this work, we consider two problems associated to the 
analysis of random waves as functional data. The first concerns the effect that the amount of energy present in the sea 
surface, as measured through the spectral density of the waves, has on the shape of the waves. The second considers the 
asymmetry of real waves as compared to simulated Gaussian waves, and also explores the effect that energy may have on 
these differences. 

The rest of this article is structured as follows. In section [2] a family of statistics for the two-sample problem for 
functional data in introduced. Section [3] gives a CLT for these statistics and section [4] gives application examples to sea 
wave data. 
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2 A family of statistics for the two-sample problem on functional data 


Let X\ (/),... ,X m (t) and Y\ (t),...,Y n (t ) be two functional data sets for which the null hypothesis of equal distributions is 
to be evaluated. The X, and Yj are assumed to live in a space of functions, SC, on the interval J. Let g i,... ,gk be a finite 
set of functions in SC. The gj might have been estimated using the X and Y samples. For fixed g £ SC, consider the dot 
product functional on SC defined by 


Gg(x) = Jj x (t)g(t)dt. 


Let (X ) and Jz f(Y) denote, respectively, the probability laws that produce the X and Y samples. Then, the corresponding 
expected values of G g (X) and G g (Y) are 


S?(X)(G g ) = EGg(X) =E Jx{t)g(t)dt and Sf(Y)(G g ) = EG g (Y) = E J Y(t)g(t)dt 


(V) 


where X and Y are random functions with distributions SC(X) and Sf(Y), respectively. For each g, define the empirical 
processes with respect to each sample by 


v x (G g ) = -L ( £ {Gg(Xi )-EG g (X)) ) and v Y (G g ) = -L ( £ (G g (Yj)-EG g (Y)) 
v m \/< m J V" \j<n 

For each fc-tuple g = (gi, • ■ ■ ,gk) of functions in SC consider the empirical process vectors 

v x(G(g)) = (Vx (G gl Vx (G gt )Y and v y (G( g)) = (v Y {G gl ),...,v Y (G gk ))' 


( 8 ) 


(9) 


Assume that for the vector g considered, the covariance matrices of Vx{G(g )) and Vy(G(g)), say C(X,g) and C(Y. g), 
exist. For a given g, under the null hypothesis, the matrices C(X,g) and C(K,g) coincide. Write C(g) for their common 
value. The class of statistics that will be considered here are of the form: 


Q n = tl(m,ny (C(g)) 1 T){m,n), with 
= a(m,n)v x (G(g))-/3(»n,n)Vy(G(g)), 


( 10 ) 


where g = (Jq,... ,gk) is the (random) vector of functions mentioned above and C(g) is a natural estimator of the common 
covariance matrix for the empirical process vectors of (|9| evaluated on the functions of g. The numbers a(m.n) and 
P(m,n) are chosen in such a way that the expected values EG(Jq)(X) and EG(g ; )(T) that appear in (|8j, cancel out 
(under the null hypothesis) in the formula for r\{m,n), making unnecessary the estimation of means. The rationale for 
considering this type of statistics is, we believe, a natural one: Under the null hypothesis, the vectors Vx(G(g))/y/m and 
Vy(G(g)) / \Jn will converge to the same limit (zero) as the sample sizes increase, causing the quadratic form, Q n , to be 
bounded in probability (it will actually converge in distribution to a chi-square limit). Under the alternative, for properly 
chosen functions gj , there will be no cancelation of the means, the norm of T](ni.n) will diverge and, therefore, Q„ will 
go to infinity. 

A particular case of the statistic Q„ is the second method proposed in Horvath and Kokoszka ( |2012 Ch. 5, p. 67), 
where the functions in g are a subset of the principal components for the joint sample, although the presentation of the 
statistic, the assumptions made and, particularly, the methods of proof of properties differ from those in the present article. 


3 A Central Limit Theorem for the statistics proposed 

Note first that, by choosing 


yjn + m y/n + m 

a = a(m,n) = —— and p = p(m,n) = -—— 


( 11 ) 
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the formula in the second line of ([TO]) reduces to 


, . \/n +m „ . . \/n + m »-i „ , . 

77 = r\{m,n) = ^— £ G g (W) - £ G g (7y) 

i<m j<n 

making it unnecessary to compute (or estimate) the expectations for the calculation of r) . From here on, we will drop the 
subscripts and write a, fi and 77 , without specifying the sample sizes, unless necessary. The functionals G g are defined on 
the same underlying probability space of the X and Y functions on which they are applied. 

For the reader’s convenience, we now recall the definitions of “covering number” and “metric entropy”. Let & C 
L P (Q), for p = 1 or 2, and a probability measure Q on a probability space. For e > 0, the e-covering number of ■'¥ with 
respect to Q, N p {£,&,Q), is the minimum natural m such that there exist functions gi,g 2 ,---,g m G L P {Q) satisfying that, 
for every f £ there is a j £ {1,... ,m} such that \\f — gj\\p,Q < £ where || • \\ p q is the norm of L P (Q). H p (e, ^, Q) = 
log/V ;) (e, Qj is called the metric entropy of &. For details on metric entropy and related notions the reader can see 
Dudley ( | 1 987[ >, [Pollard ( 1 982[ >, Pollard (1984 1 , van der Vaart (1998 1 or van der Vaart and Wellner ( 1996| >. 

We have the following proposition. 

Proposition 1 Assume that the functions g used to define G g are taken from a class Sf C S£ C L 2 (J) (it will be convenient 
to assume that is included in ). Assume as well the following: 

(i) There is a real valued function F on J, such that, for all g £ Sf, and allt £ J, |g(f)| < F(t) and ||F|||y = fjF 2 (t)dt < °°. 

(ii) The random functions X(t) satisfy ||X H 2 j <M, for some positive constant M. 

(Hi) The collection satisfies Pollard’s entropy condition with respect to Lebesgue measure: 


y/logN 2 (e\\F\\,9,X)dE<> 


( 12 ) 


where A is Lebesgue measure on J. Then, the class of functionals 

J(f={G g :g£&} 

is bounded by a constant C: for all g £ & and X ~ Jf(X), |G g (X)| < C. Furthermore, the class JrfP satisfies Pollard’s 
uniform entropy condition: 

I ^\ogN 2 (Ce,Jff)de<™ (13) 

Jo 

where N 2 (C£, = sup eg, N 2 (Ce , J(f , J§f*) is a supremum over all probability measures ££* on the set of functions where 

the X(-) live. 

Proof: For each random function X on J, and G g £ Jf’, by the Cauchy-Schwarz inequality, we have 

\G g (X)\ < y JX\t)dtIg\t)dt < M jF\t)At. (14) 

by hypothesis. Next, let g\,g\, ■ ■ ■ ,g* n be a minimal set of functions such that, for every g £ there exists j < m for 
which ||g-g*|| 2 ./ < £■ Let Jzf* be a probability measure on . Then, 

^*(G g - G g *) 2 = (^j X(t)(g-g))(t)dt^ <M£ 2 (15) 

again by the Cauchy-Schwarz inequality, and independently of the particular f£*. It follows that, for an appropriate choice 
of a positive constant y, 

N 2 (C£^)<N 2 (£Y^, A) 
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and the result follows. 

Today, there exist many ways of establishing upper bounds for the metric entropy logA^(e||F||,Sf,A.) in Proposition 
1. For instance, the process of “registering” the functional data typically involves some degree of smoothing. When this 
is the case, the functions to be analyzed and compared, that is, the functions in , will be functions of bounded variation. 
Now, if Sf is a class of functions of variation bounded by a fixed constant D > 0, then logA f 2 (e||F’||,^,A) < Ke~ l , for 
some positive constant K (see Section 3 in van der Vaart ( [1996 1 ) and this is enough for condition ( [T2| to hold. On the 
other hand, in our first example in Section 4.1, the functions in are indicators of intervals, which form a VC-subgraph 
class of functions and, therefore, satisfy condition ( fl2| ) comfortably (see [Dudley] ( | 1987] > for the details). Proposition 1 
tells us that these metric entropy bounds will be inherited by the dot product functional class, a very convenient fact. 

As for the distribution of Q n in ( fT0| >, we have the following: 

Proposition 2 To the assumptions of Proposition 1 add the following: The functions in the vector g = (gi, ■ • •, gr), appear¬ 
ing in the definition ofQ„, converge in probability, in L 2 (J), to limiting functions (gi.oo, ■ ■ ■ ,gk.oc) such that, the covariance 
matrix of the limiting dot product functional vector, G( goo)(X) = (G ?1 ^(X),... ,G gkoo (X)), say C(X, goo), exists and is not 
singular. From the X-sample, assume that this covariance matrix is estimated as the sample covariance, C(X, g) of the 
vectors (G gl (X,-),..., Gg,, ( Xj)) for i < m and likewise, from the Y-sample, as C(Y, g). Suppose we use as estimator of the 
covariance matrix C(g) in (10), the appropriate multiple of the pooled covariance matrix: 


C(g) = 


or 


m + n — 2 


((m-l)C(X,g) + (n— l)C(T,g)). 


Then, Q n converges in distribution to a chi-square variable with k degrees of freedom. 

Proof: Under the null hypothesis of equality of distributions, the matrices C(X and C(Y .g,„) are the same. We are 
writing g„, for the vector of the g j ,OC, j < k and G(goo) for the corresponding vector of dot product functionals. Now, by 
Pollard’s uniform Entropy Condition that holds for 3f?, the Donsker property holds for the dot product class Jtf. This 
means that the empirical processes Vx{G( g)) and Vy (G(g)), both indexed in Sf, converge uniformly to a limiting Gaussian 
process and, by Dudley’s asymptotic equicontinuity condition and the assumed convergence of the functions in the vector 
g- 

Vx(G(g)) ^ Vx (G(g«,)) and, likewise, v F (G(g)) ^ Vy(G(g 00 )). 

Since the processes Vx(G(g)) and vy ( Gfg )) are independent, the quadratic form Q*, computed with the same formula of 
Q n , but using C(X ,g.fi as covariance matrix, will have, by the Continuous Mapping theorem, the chi-square distribution 
of the statement. Thus, by Slutzky’s theorem, it only remains to show that C(X,g) converges pointwise, in probability, to 
C(X, goo). But using inequalities ( 14 1 and ( |T5] >, it is easy to see that the covariance matrix C(X.g) is a continuous function 
of the vector g, with respect to the norm of lf(J). Thus, by the triangle inequality, it suffices to have a uniform law of 
large numbers for the class 

^ 2) ={G g Gf. g,f£&} (16) 


and for the class Jif as well. 
Proposition [T] we get 


Now, let ££* be a probability law on 3K and g,g 'functions in Sf. Then, using 


JX*\G g G f -G g ,G f ,\ 


< X*{\G g -G g ,\\G f \)+X*{\G f -G f \\G g \) 

< C(fi?*(\G g — G g '\)+fi?*(\G f — Gf\)), 


for the constant C in that Proposition. It follows that. 


N l (e,j(f* 2 '),sr) 
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and since the covering number A^(e/2C, iff) satisfies Pollard’s uniform entropy condition (13 i, the same will hold for 
sup^ftN i(£,J$?' 2 \J?*) (squaring the covering number does not affect the entropy condition), and this is more than 
enough for a Uniform Law of Large Numbers for The argument for ,W J is simpler and omitted, and the proof of 

Proposition [2]is complete. 


4 Performance evaluation on examples 

This section describes the application of the methodology presented in Sections[2]and[3]on three problems from the field 
of Oceanography. The first application is related to the comparison of spectral densities, while the other two examples are 
related to the analysis of the shape of waves. 

4.1 Comparison of spectral densities 

As was mentioned in the Introduction, a frequent model for the sea surface elevation at a fixed point is a centered stationary 
Gaussian random process X(t). The covariance r(h) = E(X(t)X(t + h)) of this process has a spectral representation given 
by 

r(h) = J e lhC 0 s(co)dco , 

where the function .?(•) is known as the spectral density. However, the stationarity hypothesis is not valid in the middle or 
long term, and the use of stationary models is limited in time, depending on the weather conditions at the place of study. 
An interesting and important problem is that of determining the duration and characteristics of the stationary intervals for 
these processes, and one possible point of view for this problem is the analysis of the spectral densities estimated during 
short periods of time. 

Sea surface elevation data frequently come from moored buoys and the sampling frequency is usually between 1 and 2 
Hz. Data are stored in 20 or 30-minute intervals, which are considered to be short enough for the stationarity assumption 
to hold, but long enough to have a good estimation of the spectral density. Using this information, a possible approach 
for the stationarity problem is to estimate the spectral density for each time interval. Using the techniques developed in 
the previous sections, one can compare, as we shall see, the estimated spectral densities to determine whether they come 
from the same distribution or not. If they do, and they are contiguous in time, they correspond to a stationary period in 
the wave data. 

A simulation study was carried out in this context, in order to compare spectral densities, using the Matlab toolbox 
WAFO ( Brodtkorb et al.[ 2QQ0( > and spectra from the Torsethaugen parametric family. This is a set of bimodal spectral 
densities of frequent use in Oceanography, which account for the simultaneous presence of wind-generated waves and 
swell, and was developed to model spectra observed in the North-Sea. More details can be found in Torsethaugen ( |1993 1 
and [Tors ethau gen and Haver| ( |2004| l . 

The parameters for the Torsethaugen family are the significant wave height H s and the spectral peak period T p . The 
significant wave height is a standard measure of sea severity and is defined as H s = 4a, where a 2 , the variance of the 
process, is the integral of the spectral density s: 


<y 2 = s(co)dco. 


The spectral peak period is the inverse of the modal (peak) frequency of the spectral density. 

The simulation scheme was as follows: Two spectral densities were chosen from the parametric family. The param- 

= 4.1. Figure 4.1 (left) shows the corresponding spectral 


eters were set at H s = 2 in both cases and T p = 4.0 and T p 


densities. From these densities and using the WAFO toolbox, stationary Gaussian random (wave) processes lasting 30 
minutes were simulated, with a sampling frequency of 1.28 Hz., i.e., the time interval between two consecutive points is 
0.78125 seconds. These simulations correspond to what would have been observed using a moored buoy. 
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Figure 2: Torsethaugen spectra (left) and estimated spectral densities for T p = 4.0 (center) and Tp = 4.1 (right). 

From each simulation, the spectral density was estimated using a Parzen window with length 60. This was repeated 
10 times for each of the two original spectral densities, yielding two independent samples of 10 functions each, which 
come from different populations. Due to the random variations in the simulation and estimation stages, these curves are 
similar in shape but present important variations. Figures [44] (center) and (right) show the two samples. These densities 
were represented using a b-spline basis of order 5 with 51 nodes in the interval [0, Tz\. 

For testing whether the two samples come from the same distribution, two versions of the Q„ statistic were considered. 
For the first one, the range of frequencies [0, n] was divided into 8 intervals and the indicator functions of these intervals 
play the role of the g functions. In this case, the score corresponding to a projection along the direction of one of these 
indicator functions is equivalent to integrating the energy for the range of frequencies represented by the interval. For 
the second version, a b-spline basis of order 5 with 7 interior nodes was used as g functions. The values obtained for Q n 
for these two versions of the statistic were 39.105 and 120.18, respectively, with corresponding /;-values, with respect to 
the asymptotic distribution, of 4.7 x 10“ 6 and 0. Nevertheless, taking into account the small sample sizes, asymptotic 
p-values cannot be considered valid and we must resort to Monte Carlo p-values. For this purpose, we use the fact that an 
algorithm is available for the generation of independent samples with a given spectral density. From the “original” set of 
20 estimated spectral densities (10 for each parameter choice) an average spectral density was estimated, say savg- Then, 
.s'avg was used to produce two sets of 10 simulated stationary Gaussian random (wave) processes lasting 30 minutes, using 
the WAFO toolbox, as described above. From each 30 minute simulated wave process, the corresponding spectral density 
was estimated, to produce two sets of 10 spectral densities under the null hypothesis. On these two samples of spectral 
densities the statistic Q„ was computed. The simulation procedure just described was repeated 10,000 times, using always 
the same savg. and the 10,000 values of Q„ produced were used to estimate the p-value for the original value of the 
statistic. The resulting Monte Carlo p-values were 0.0492 and 0.0398, for the two schemes of g functions considered, 
which shows that even for the small sample sizes considered and for two similar Torsethaugen spectral densities, the 
method proposed is able to produce some evidence of difference. 

Next, we performed the same simulation experiment, with larger sample sizes, in order to assess speed of convergence 
to the limiting distribution. In this case we considered a sample size of 140 estimated spectral densities for T p = 4.0 and 
160 for T p = 4.1. The Q„ values for the sample were 507.03 for the indicator basis and 517.16 for the b-spline basis, 
with p-values equal to 0 in both cases. The Monte Carlo procedure is identical to the one described above for the smaller 
sample sizes and, in this case, from the Monte Carlo simulations, approximate quantiles were estimated, along with the 
Monte Carlo p-values. The results, regarding quantiles, for both versions of Q n , are given in table |T| 

The results in Table[l]show that the relative errors, between Monte Carlo and asymptotic quantiles, vary between 1.3 
and 6.5% and are always negative, indicating that the asymptotic values underestimate the true quantiles of the statistic in 
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Spline basis 


Quantile 

0.5 

0.9 

0.95 

0.975 

0.99 

Asymptotic 

11.34 

18.549 

21.026 

23.337 

26.217 

MC 

11.857 

19.451 

22.242 

24.846 

27.445 

Rel. error 

-0.046 

-0.049 

-0.058 

-0.065 

-0.047 


Indicator basis 


Quantile 

0.5 

0.9 

0.95 

0.975 

0.99 

Asymptotic 

7.344 

13.362 

15.507 

17.535 

20.09 

MC 

7.443 

13.819 

16.064 

18.202 

20.892 

Rel. error 

-0.013 

-0.034 

-0.036 

-0.038 

-0.040 


Table 1: Finite sample and limiting quantiles for the spectral density data. 


this case. Thus, for this problem and the choices of functions g made for Q n , we conclude that the convergence to the limit 
distribution of the statistic is slow and it is advisable to calculate the necessary p-values using a Monte Carlo procedure. 

4.2 Waves as functional data 

The other two examples in this section concern the analysis of waves as real functions. The raw data consists of sea surface 
elevation measurements at a fixed point obtained from the U.S. Coastal Data Information Program (CDIP) website. The 
data come from buoy 106 (51201 for the National Data Buoy Center), a station located at Waimea Bay, Hawaii, at a sea 
depth of 200 meters. The surface elevation was sampled at a frequency of 1.28 Hz, during 30-minute intervals. A total of 
430 intervals (8 days and 23 hours), between January 1 st and January 9^, 2003, were considered. 

A wave is defined as the curve of surface elevation values between two consecutive downcrossings of the mean sea 
level (see Figure|T]». For each 30-minute interval, the individual waves were considered as functions. Since the time length 
of each individual wave (the period) is different, all waves were registered to the [0,1] interval by a linear transformation 
of the time interval. After registration, waves were initially represented using a B-spline basis of order 6 with nodes at the 
data points that define each wave. Then, these functions are represented using a common basis, again B-splines of order 
6 , but with 61 equidistant nodes on the interval [0,1], so that all waves have a representation in terms of a common basis. 
The order of the splines guarantees that the functions are smooth, having two continuous derivatives. 

Spectral densities were estimated for each 30-minute interval using the toolbox WAFO and the values of a 2 and H s 
were obtained for each data interval. Figure [3] shows both the original sea surface elevation data and the evolution of H s 
in time. 

For the purpose of evaluation of the methodology proposed here, the 30-minute intervals were divided into four groups, 
G1,G2,G3 and G4, according to the value of their significant wave height; the groups correspond to values in the ranges 
0 — 2 m., 2 — 4 m., 4 — 6 m. and values over 6 meters for G4. 

Within each energy group, two consecutive 30-minute intervals were selected, and the waves corresponding to those 
intervals constitute the data set for the group. The selected sets of waves are indicated in Figure [T2] (left) and are denoted 
in what follows as H s 1, H s 2, H s 3 and H s 4, with significant wave height increasing with numbering. The number of waves 
in each one-hour set are, respectively, 166, 171, 179 and 187. Figure [4~2] (right) shows the waves in the sets H s 1 to H s 4. 
It is clear that, in terms of amplitude, the waves in these groups are different, with the possible exception of H s 3 versus 
H s 4. We wish to quantify these differences with the methodology proposed in the previous section. We will compare, in 
the context of the two-sample problem, H s 1 versus H s 2, H s 2 versus H s 3 and H s 3 versus H s 4. 
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Figure 3: Wave height (top) and significant wave height (bottom) for Buoy 106. 
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Figure 4: Position of selected intervals (left), waves in the selected time intervals (right). 


4.3 Projection on odd and even trigonometric functions 


For the problem of comparing the different data sets described in section 4.2 we apply the statistic Q n , using two functions 
in the vector g, that will be certain projections of the joint data set on linear combinations of the odd and even trigonometric 
functions with coefficients determined from the joint sample of functions. For each registered wave, Z,-(f), in the joint 
sample, we consider its /-th sine and cosine Fourier coefficients, given by 


an = [ Zj(t) sin(27r/f)dr and bn = [ Z,(f)cos(27r/f)dr. (17) 

J o Jo 

We compute these coefficients for / < k = 3, since the coefficients decrease very rapidly. For each 1 < k, we take the 
averages of the absolute values of the an and bn as representatives of the relevance of the /-th term in the expansion: 


1 N i N 

M and for 1 = 1,2 and 3, 


( 18 ) 


where, as before, N = m + n is the size of the joint sample. Then, we take as our functions, g\ and g 2 , the following 

3 3 

gi = J2d/sin(27T/f) and gi = V bicos(2nlt). (19) 

i=\ i =l 
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Figure 5: g\ (solid), go (dashed) and sine function (dotted) for the H s \ vs. H s 2 test. 


These functions are calculated for each pair of energy levels: H s 1 versus H s 2, H s 2 versus H s 3 and // v 3 versus // v 4. As a 
reference. Table [2] shows the values of the coefficients that define g i and go in the case of H s 1 versus H s 2 and Figure [5] 
shows the corresponding g\ and go functions plus a sine function for comparison purposes. Table[3]shows the values and 
p-values obtained when Q„ is calculated for testing the difference of distribution between the groups of waves of different 
energy. Note that this time we evaluate the value obtained for Q ,, against the chi-square distribution with 2 degrees of 
freedom. 

The numbers in Table [3] reflect very strong evidence against the null hypothesis in all cases, especially in the first 
two, as could be expected from the waveforms in Figure [4] (right), and these results also say that the projections on the 
trigonometric functions are enough, through the Q n statistic, for detecting the difference in energy between the samples. 


j 

a j 

b i 

1 

0.916 

0.151 

2 

0.097 

0.097 

3 

0.024 

0.029 


Table 2: Coefficients defining gi and go for the H s 1 vs. H s 2 test. 


Pair of samples tested 

Q n value 

p-value 

H s 1 versus H s 2 

H s 2 versus H s 3 

H s 3 versus H s 4 

103.75 

107.37 

17.01 

0 

0 

2.02 x 10~ 4 


Table 3: Q n values for projections on odd and even trigonometric functions. 

This result, however, is to be expected from the differences in amplitude that can be observed in Figure[4] So a natural 
question is whether the dissimilarities are only in amplitude, or whether there are also differences in the shape of the waves 
due to the variation in energy levels. To test if there are differences between these samples other than in amplitude, the 
normalized waves were considered, where the normalization was obtained dividing by the standard deviation estimated 
for each one-hour interval. We consider intervals H s l,H s 2 and H s 3 and Figure[6]shows the registered waves for the three 
possible pairs of normalized samples. The differences among them, if there are any, are not so obvious now. 

Using the same method as before, the three pairs of samples were compared to test whether the curves come from 
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Figure 6: Waves in intervals H s 1 and H s 2 (left), H s 1 and H s 3 (center) and H s 2 and H s 3 (right). 


the same distribution. The results of these tests are given in table [4} where the p-values obtained using the asymptotic 
distribution and a bootstrap procedure, described below, are included. 


Pah' of samples tested 

Q n value 

p-value (asymp.) 

p-value (bootstrap) 

H s 1 versus H s 2 

17.15 

1.88 x 10~ 4 

4 x 10~ 4 

H s 2 versus H s 3 

1.023 

0.5997 

0.5997 

H s 3 versus H s 4 

2.258 

0.323 

0.333 


Table 4: Q n and p-values based on principal components for the normalized samples. 

These values show that the differences are not so clear after normalization, and point to the first interval H s 1 being 
different from the other three, but there is no evidence of differences between H s 2 and H s 3 or H s 3 and H s 4. Figure [7] 
shows the normalized spectra for these intervals, and may help explain the results obtained, since the spectral density 
of a time series sums up its oscillatory behavior. As can be seen, the spectra for intervals H s 2, H s 3 and H s 4 are similar 
in dominant frequency and dispersion while the spectrum for H s 1 is clearly different in both aspects. It is important to 
observe, however, that the process of registration of the individual waves to a common interval losses the information 
about the period of the wave, and hence also about frequency. Thus it seems likely that it is the dispersion (and shape) of 
the spectral density rather than its location in the frequency scale which accounts for the differences observed in the three 
samples. Nevertheless, the relationship between spectral densities and the shape of waves is not clear and requires further 
exploration. 

Next, we evaluate whether, in this example, the asymptotic distribution as a reference is valid for the sample sizes 
considered. Thus, our next experiment evaluates, through a bootstrap procedure, the approximation to the null distribution 
in the present context. The bootstrap method used here does not require the estimation of spectral densities and, for this 
reason, is computationally significantly less expensive than the Monte Carlo procedure described before. 

For this purpose, the 166 waves of data set H s 1 were used. The data were randomly split into two sets of 106 and 60 
waves, respectively, and the Q n statistic was computed. This procedure was repeated for 10,000 random selections of the 
two subsets, of 106 and 60 waves (from the same joint sample of 166), calculating Q n every time. From the 10,000 values, 
we obtain quantiles of Q„ that correspond, approximately, to the null hypothesis and are displayed in Table [5] were we 
have included, for comparison purposes, the corresponding quantiles for the Xo distribution. 

The good agreement between finite sample and limiting quantiles in Table [5] suggest that for sample sizes above a 
hundred for both samples, the proposed statistic can be confidently used for the type of data considered in this example. 
This form of bootstrap was used to produce the “bootstrap” quantiles in Table |4]by bootstrapping from the joint dataset in 
each case. 
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Figure 7: Spectral densities for intervals H s 1. H s 2 and H s 3. 



.5 

.9 

Probabilities 

.95 

.975 

.99 

asymptotic 

1.386 

4.605 

5.992 

7.378 

9.21 

bootstrap 

1.365 

4.662 

6.147 

7.454 

9.169 

relative error 

0.0157 

-0.0124 

-0.0259 

-0.0103 

0.0045 


Table 5: Finite sample and limiting quantiles of Q n for H s 1 data. 


4.4 Asymmetry of real waves 


One of the advantages of the set of statistics proposed in this work is its flexibility. In this section we show how it is 
possible to construct statistics suitable for the assessment of symmetry in samples of functions. In Gorrostieta et ah 
d20l4| , sets of registered storm waves were evaluated for asymmetry and also compared, by means of a conditional 
permutation test, to sets of waves generated from the Gaussian model with parameters estimated from the data. The 
Gaussian model, as described for instance in |Ochi| ( |1998[ ), is a standard stochastic model for sea waves. Still, it was 
pointed out in the introduction that real waves differ from those produced by the model, in that real waves present more 
asymmetry than the model would allow, having shallower troughs and more peaked crests, and this difference may be 
more marked at higher energy levels. 

We will now use the proposed statistic Q n to test for the null hypothesis that registered real waves and waves produced 
by the Gaussian model have the same distribution against the alternative that real waves show more asymmetry. For 
this purpose, we take a set of waves corresponding to two consecutive 30 minute period in each of the four energy 
levels considered in Section 4.2. For this analysis, in the registration process of the waves an added restriction was that 
the upcrossing of the mean level occurs at 0.5. The precise description of the registration procedure can be found in 
Gorrostieta et al.]( |2014[ >. From each dataset, the spectral density is estimated and from it, a set of simulated waves is 
produced, using the WAFO toolbox of the MATLAB language. For each energy level, from the combined sample (real 
and simulated waves), we compute, for I < 3 and i < N, the coefficients an, the average a/ and the function gi of C3- 
(fl8]> and (f]~9|>. The idea is the following: The suspected asymmetry of real waves consists on the waves being less deep. 


13 





























” 




A 





-- 


mk 

2 



JB| 


i&IL 

\ ■/ 

g s- 

t / * 

S -- 


B -- 


-lgf 

7 

l§f 

T 

W 

7 - 


fll 


l|i' 



7 



T 


7 


7 




Figure 8: Real and simulated waves for the four groups considered. 


in terms of amplitude, during the first half cycle, than the amount they rise above sea level on the second half cycle. This 
type of asymmetry should show in the dot product against a relevant odd function. Thus, we estimate a representative odd 
function gi and apply the statistic Q n with that function alone. If the alternative hypothesis holds, we expect that, for the 
real waves, the dot product with g\ will have a positive mean value, while it will tend to zero for the simulated waves. 

Figure[8]shows the registered real waves and simulated waves considered for this symmetry test in groups H s 1 to H s 4. 
At first sight, in none of the cases is the asymmetry of the real waves evident. 

Application of the procedure described above to the selected wave samples produced the results presented in Table 
[6] In this analysis the p-values are computed with the chi-square distribution with one degree of freedom, since a single 
function is used in the vector g. At the first two energy levels, we find no evidence of the asymmetry, and in this sense, 
the mathematical model used seems to be producing waves similar to the real ones. At the higher energy levels, H s 3 and 
H s 4 the value of the statistic is significant at the 5% level, being much stronger the evidence for the H s 4 data. This results 
seem in agreement with what has been concluded in the literature by other means. 


Energy level 

Sample sizes (real | simulated) 

Q n value 

/;-value 

H s 1 

85 

78 

2.69 

0.101 

H s 2 

104 

156 

0.31 

0.578 

H s 3 

136 

127 

4.87 

0.027 

HA 

151 

121 

51.58 

6.9 x 10~ 13 


Table 6: Q„ values for asymmetry test 

As a final remark, we conclude that the proposed methodology is a flexible tool that can be used to test the validity of 
different hypothesis on functional data sets. 
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